The diameter of the world wide web 



Despite its increasing role in communication, the world wide web (www) remains the 
least controlled medium: any individual or institution can create websites with unrestricted 
number of documents and links. This unregulated growth leads to a huge and complex 
web, which is a large directed graph, whose vertices are documents and edges are the links 
(URLs) pointing from one document to another. The topology of this graph determines 
the web's connectivity and, consequently, our effectiveness in locating information on the 
www. However, due to its large size (estimated to be at least 8 x 10 8 documents |J), and 
the continuously changing documents and links, it is impossible to catalogue all vertices and 
edges. The challenge in obtaining a complete topological map of the www is illustrated by 
the limitations of the commercial search engines: Northern Light, the search engine with the 
largest coverage, is estimated to index only 38% of the web [|TJ. While great efforts are made 
to map and characterize the Internet's infrastructure [0, little is known about what truly 
matters in searching for information, i.e., about the topology of the www. Here we take a 
first step to fill this gap: we use local connectivity measurements to construct a topological 
model of the www, allowing us to explore and characterize the large scale properties of the 
web. 

To determine the local connectivity of the www, we constructed a robot, that adds 
to its database all URLs found on a document and recursively follows these to retrieve the 
related documents and URLs. From the collected data we determined the probability P ou t{k) 
(Pi n (k)) that a document has k outgoing (incoming) links. As Figs, la and b illustrate, we 
find that both P ou t{k) and Pi n {k) follow a power-law over several orders of magnitude, 
remarkably different not only from the Poisson distribution predicted by the classical theory 
of random graphs by Erdos and Renyi |||J , but also from the bounded distribution found in 
recent models of random networks |5|]. The power law tail indicates that the probability of 
finding documents with a large number of links is rather significant, the network connectivity 
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being dominated by highly connected web pages. The same is true for the incoming links: the 
probability of finding very popular addresses, to which a large number of other documents 
point, is non-negligible, an indication of the flocking sociology of the www. Furthermore, 
while the owner of each web page has complete freedom in choosing the number of links on 
a document and the addresses to which they point, the overall system obeys scaling laws 
characteristic only of highly interactive self-organized systems and critical phenomena || . 

To investigate the connectivity and the large-scale topological properties of the www, 
we construct a directed random graph consisting of N vertices, assigning to each vertex k 
outgoing (incoming) links, such that k is drawn from the power-law distribution shown in 
Fig. la and b. To achieve this, we randomly select a vertex i and increase its outgoing (in- 
coming) connectivity to k{ + 1 if the total number of vertices with ki + 1 outgoing (incoming) 
links is less than NP out (ki + 1) (NP in (ki + 1)). A particularly important quantity in a search 
process is the shortest path between two documents, dl, defined as the smallest number 
of URL links one needs to follow to navigate from one document to the other. As Fig. lc 
shows, we find that the average of d over all pairs of vertices follows (d) = 0.35 + 2.06 log(iV), 
indicating that the web forms a small- world network |5],|7[], known to characterize social or 
biological systems. Using N = 8x 10 8 [[TJ, we find (d www ) = 18.59, i.e., two randomly chosen 
documents on the web are on average 19 clicks away from each other. Since for a given 
N, d follows a Gaussian distribution, (d) can be interpreted as the diameter of the web, a 
measure of the shortest distance between any two points in the system. Despite its huge 
size, our results indicate that the www is a highly connected graph of average diameter of 
only 19 links. The logarithmic dependence of (d) on iV is important to the future potential 
of the www: we find that the expected 1000% increase in the size of the web over the next 
few years will change (d) from 19 to only 21. The relatively small value of d suggests that 
an intelligent agent, i.e., who can interpret the links and follow only the relevant one, can 
find in a short time the desired information by navigating the www. However, this is not the 
case for a robot, that locates the information based on matching strings: we find that such 
a robot, aiming to identify a document at distance (d), needs to search M((d)) — 0.53iV°- 92 
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documents which, using N = 8 x 10 8 |] , leads to M = 8 x 10 7 , i.e., to 10% of the full www. 
This indicates that robots cannot benefit from the highly connected nature of the web, their 
only successful strategy being indexing as large a fraction of the www as possible. 

The uncovered scale free nature of the link distributions indicates that collective phe- 
nomena play an unsuspected role in the development of the www ||, requiring us to look 
beyond the traditional random graph models A better understanding of the web 

topology, aided by modeling efforts, is crucial in developing search algorithms or designing 
strategies for making information widely accessible on the www. The good news is that, due 
to the surprisingly small diameter of the web, all that information is just a few clicks away. 
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FIG. 1. The distribution of (a) outgoing links (URLs found on an HTML document) and 
(b) incoming links (URLs pointing to a certain HTML document). The data were obtained from 
the complete map of the nd.edu domain, that contains 325,729 documents and 1,469,680 links. 
The dotted lines in (a) and (b) represent the analytical fits we used as input distributions in 
constructing the topological model of the www, the tail of the distributions following P(k) ~ /c~ 7 , 
with 7 out = 2.45 and ji n = 2.1. (c) Average of the shortest path between two documents as a 
function of the system size, as predicted by the model. As a check of the validity of our predictions, 
we have determined d for documents in the domain nd.edu. The measured (d n d. e du) = H-2 agrees 
well with the prediction ((^xio 5 ) = H-6 obtained from our model. To show that the power-law tail 
of P(k) is a universal feature of the www, in the inset we show P ou t(k) obtained by starting from 
whitehouse.gov (squares), yahoo.com (upward triangles) and snu.ac.kr (downward triangles). The 
slope of the dashed line is j ou t = 2.45, as obtained from nd.edu in (a). 
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